The People's Web meets Linguistic Knowledge: Automatic Sense Alignment of Wikipedia and WordNet
نویسندگان
چکیده
We propose a method to automatically alignWordNet synsets andWikipedia articles to obtain a sense inventory of higher coverage and quality. For eachWordNet synset, we first extract a set of Wikipedia articles as alignment candidates; in a second step, we determine which article (if any) is a valid alignment, i.e. is about the same sense or concept. In this paper, we go significantly beyond stateof-the-art word overlap approaches, and apply a threshold-based Personalized PageRank method for the disambiguation step. We show that WordNet synsets can be aligned to Wikipedia articles with a performance of up to 0.78 F1-Measure based on a comprehensive, well-balanced reference dataset consisting of 1,815 manually annotated sense alignment candidates. The fully-aligned resource as well as the reference dataset is publicly available.
منابع مشابه
WordNet―Wikipedia―Wiktionary: Construction of a Three-way Alignment
The coverage and quality of conceptual information contained in lexical semantic resources is crucial for many tasks in natural language processing. Automatic alignment of complementary resources is one way of improving this coverage and quality; however, past attempts have always been between pairs of specific resources. In this paper we establish some set-theoretic conventions for describing ...
متن کاملWord Sense Disambiguation Using Wikipedia
This paper describes explorations in word sense disambiguation using Wikipedia as a source of sense annotations. Through experiments on four different languages, we show that the Wikipedia-based sense annotations are reliable and can be used to construct accurate sense classifiers.
متن کاملNot Just Bigger: Towards Better-Quality Web Corpora
For the acquisition of common-sense knowledge as well as as a way to answer linguistic questions regarding actual language usage, the breadth and depth of the World Wide Web has been welcomed to supplement large text corpora (usually from newspapers) as a useful resource. While purists’ criticism on unbalanced composition or text quality is easily shrugged off as unconstructive, empirical resul...
متن کاملAutomatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملUnderstanding Users Intent by Deducing Domain Knowledge Hidden in Web Search Query Keywords
Search Engines are used by people on a daily basis to retrieve information from the web. When an ambiguous word is present in a query, specific sense of the keyword is not considered during the search process. Search engines return a large amount of web pages as results from all the possible contexts. Users tend to browse only few pages. Improving quality of retrieved results is a challenge and...
متن کامل